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(54) Method and apparatus for providing failure detection and recovery with predetermined 
degree of replication for distributed applications in a network 



(57) An application module (A) running on a host 
computer in a computer network is failure-protected with 
one or more backup copies that are operative on other 
host computers in the network. In order to effect fault 
protection, the application module registers itself with a 
ReplicaManager daemon process (112) by sending a 
registration message, which message, in addition to 
identifying the registering application module and the 
host computer on which it is running, includes the par- 
ticular replication strategy (cold backup, warm backup, 
or hot backup) and the degree ol replication associated 
with that application module. The backup copies are 
then maintained in a fail-over state according to the reg- 
istered replication strategy. A WatchDog daemon (113), 
running on the same host computer as the registered 
application periodically monitors the registered applica- 
tion to detect lailures. When a failure, such as a crash 
or hangup of the application module, is detected, the fail- 
ure is reported to the ReplicaManager, which effects the 
requested tail-over actions. An additional backup copy 
is then made operative in accordance with the regis- 
tered replication style and the registered degree of rep- 
lication. A SuperWatchDog daemon process (115-1), 
running on the same host computer as the ReplicaMan- 
ager, monitors each host computer in the computer net- 
work. When a host failure is detected, each application 



module running on that host computer is individually fail- 
ure-protected in accordance with its registered replica- 
tion style and degree of replication. 
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Description 
Technical Field 

[0001] This invention relates to detection of a failure 
of an application module running on a host computer on 
a network and recovery from that failure. 

Background of the Invention 

[0002] In order for an application module running on 
a host computer in a network to provide acceptable per- 
formance to the clients accessing it, the application 
module must be both reliable and available. In order to 
provide acceptable performance, schemes are required 
for detecting the failure of an application module or the 
entire host computer running it, and for then quickly re- 
covering from such a detected failure. Replication of the 
application module on other host computers in the net- 
work is a well known technique that can be used to im- 
prove reliability and availability of the application mod- 
ule. 

[0003] Three strategies are known in the art for oper- 
ating and configuring the fail-over process as it applies 
to the replicas, or backup copies, of an application mod- 
ule and which define a state of preparedness for these 
backups. In the first strategy, known as a "cold backup" 
style, only the primary copy of an application module is 
running on a host computer and other backup copies 
remain idle on other host computers in the network. 
When a failure of the primary copy of the application 
module is detected, the primary copy of the application 
module is either restarted on the same host computer, 
or one of the backup copies of the application module 
is started on one of the other host computers, which 
backup then becomes the new primary. By using a 
checkpointing technique to periodically take "snap- 
shots" of the running state of the primary application 
module, and storing such state in a stable storage me- 
dia, when a failure of the primary application module is 
detected, the checkpoint data of the last such stored 
state of the failed primary application module is supplied 
to the backup application module to enable it to assume 
the job as the primary application module and continue 
processing from such last stored state of the failed pri- 
mary application module. 

[0004] The second strategy is known as a "warm 
backup" style. Unlike the cold backup style in which no 
backup of an application module is running at the same 
time the primary application module is running, in the 
warm backup style one or more backup application 
modules run simultaneously with the primary application 
module. The backup application modules, however, do 
not receive and respond to any client requests, but pe- 
riodically receive state updates from the primary appli- 
cation module. Once a failure of the primary application 
module is detected, one of the backup application mod- 
ules is quickly activated to take over the responsibility 



of the primary application module without the need for 
initialization or restart, which increases the time required 
for the backup to assume the processing functions of 
the failed primary. 

s [0005] The third strategy is known as a "hot backup" 
style. In accordance with this style, two or more copies 
of an application module are active at run time. Each 
running copy can process client requests and states are 
synchronized among the multiple copies. Once a failure 

io in one of the running application modules is detected, 
any one of the other running copies is able to immedi- 
ately take over the load of the failed copy and continue 
operations. 

[0006] Unlike the cold backup strategy in which only 

is one primary is running at any given time, both the warm 
backup and hot backup strategies advantageously can 
tolerate the coincident failure of more than one copy of 
a particular application module running in the network, 
since multiple copies of that application module type are 

20 simultaneously running on the network. 

[0007] Each of the three replication strategies incur 
different run-time overheads and have different recov- 
ery times. One application module running on a network 
may need a different replication strategy based on its 

2S availability requirements and its run time environment 
than another application module running on the same 
host computer or a different host computer within the 
network. Since distributed applications often run on het- 
erogeneous hardware and operating system platforms, 

30 the techniques to enhance an application module's re- 
liability and availability must be able to accommodate 
alt the possible replication schemes. 
[0008] In U.S. Patent 5,748,882 issued on Mary 5, 
1 998 to Y Huang, a co-inventor of the present invention, 

35 which patent is incorporated herein by reference, an ap- 
paratus and a method for fault tolerant computing is dis- 
closed. As described in that patent, an application or 
process is registered with a "watchdog" daemon which 
then "watches" the application or process for a failure 

40 or hangup. If a failure or hangup of the watched appli- 
cation is detected, then the watchdog restarts the appli- 
cation or process. In a multi-host distributed system on 
a network, a watchdog daemon at a host computer mon- 
itors registered applications or processes on its own 

45 host computer as well as applications or processes on 
another host computer. If a watched host computer fails, 
the watchdog daemon that is watching the failed host 
computer restarts the registered processes or applica- 
tions that were running on the failed watched node on 

50 Ks own node. In both the single node and multiple node 
embodiments, the replication strategy for restarting the 
failed process or application is the cold backup style, i. 
e., a new replica process or application is started only 
upon the failure of the primary process or application. 

55 [0009] Disadvantageous!^ prior art fault-tolerant 
methodologies have not considered and are not adapt- 
able to handle multiple different replication strategies, 
such as the cold, warm and hot backup styles described 
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above, that might best be associated with each individ- 
ual application among a plurality of different applications 
that may be running on one or more machines in a net- 
work. Furthermore, no methodology exists in the prior 
art for maintaining a constant number of running appli- 
cations in the network for the warm and hot backup rep- 
lication styles. 

Summary of the invention 

[0010] In accordance with the present invention, an 
application module running on a host computer is made 
reliable by first registering itself for its own failure and 
recovery processes. A ReplicaManager daemon proc- 
ess, running on the same host computer on which the 
application module is running or on another host com- 
puter connected to the network to which the application 
module's machine is connected, receives a registration 
message from the application module. This registration 
message, in addition to identifying the registering appli- 
cation module and the host machine on which it is run- 
ning, includes the particular replication strategy (cold, 
warm or hot backup style) and the degree of replication 
to be associated with the registered application module, 
which registered replication strategy is used by the Rep- 
licaManager to set the operating state of each backup 
copy of the application module as well as to maintain 
the number of backup copies in accordance with the de- 
gree of replication. A Watchdog daemon process, run- 
ning on the same host computer as the registered ap- 
plication module then periodically monitors the regis- 
tered application module to detect failures. When the 
Watchdog daemon detects a crash or a hangup of the 
monitored application module, it reports the failure to the 
ReplicaManager, which in turn effects a fail-over proc- 
ess. Accordingly, if the replication style is warm or hot 
and the failed application module cannot be restarted 
on its own host computer, one of the running backup 
copies of the primary application module is designated 
as the new primary application module and a host com- 
puter on which an idle copy of the application module 
resides is signaled over the network to execute that idle 
application. The degree of replication is thus maintained 
thereby assuring protection against multiple failures of 
that application module. If the replication style is cold 
and the failed application is cannot be restarted on its 
own host computer, then a host computer on which an 
idle copy of the application module resides is signaled 
over the network to execute the idle copy. In order to 
detect a f ailu re of a host computer or the Watchdog dae- 
mon running on a host computer, a SuperWatchDog 
daemon process, running on the same host computer 
as the ReplicaManager, detects inputs from each host 
computer. Upon a host computer failure, detected by the 
SuperWatchDog daemon by the lack of an input from 
that host computer, the ReplicaManager is accessed to 
determine the application modules that were running on 
that host computer. Those application modules are then 



4 

individually failure-protected in the manner established 
and stored in the ReplicaManager. 

Brief Description of the Drawing 

5 

[0011] 

FIG. 1 is a block diagram of a computer network 
illustratively showing a plurality of host computers 
io running application modules which are failure pro- 
tected in accordance with the present invention; 
and 

FIG. 2 shows a table stored in the ReplicaManager 
daemon, running on a host computer in the network 
is in FIG. 1 , that associates, for each type of applica- 
tion module, information used to effect failure pro- 
tection in accordance with the present invention. 

Detailed Description 



[0012] With reference to FIG. 1, a network 100 is 
shown, to which is connected a plurality of host comput- 
ers. The network 100 can be an Ethernet, an ATM net- 
work, or any other type of data network. For illustrative 

25 purposes only, six host computers H1, H2, H3, H4, H5 
and H6, numerically referenced as 101, 102, 103, 104, 
105, and 106 r respectively, are connected to the net- 
work 1 GO. Each host computer has a plurality of different 
application modules residing in its memory. These ap- 

30 plication modules, being designated in FIG. 1 as being 
of a type A, B and C, each has a primary copy executed 
and running on at least one of these six host computers. 
Specifically, in this illustrative example, a primary copy 
of the type A application module, application module A, , 

35 is running of host computer H1 , a primary copy of the 
type B application module, application module B,, is 
running on host computer H4, and a primary copy of the 
type C application module, application module C 1( is 
running on host computer H3. Other copies of each type 

40 of application module, as will be described, are either 
stored and available from memory on at least one of the 
other host computers in an idle state awaiting later ex- 
ecution, or are running as a backup copies or second 
primary copies of application modules. 

45 [0013] As previously described, an application mod- 
ule running on a host computer is fault-protected by one 
or more backup copies of the application module that 
are operated in a state of preparedness defined by one 
of three known replication styles. Each replication style 

so has its own method of providing backup to an application * 
module which fails by means of crashing or hanging up, 
or to all those application modules residing on a host 
computer that itself fails. In accordance with the present 
invention, each application module type is fau It-protect - 

55 ed with the specific replication style, (cold backup, warm 
backup, hot backup) that is best suited to its own 
processing requirements. Furthermore, in accordance 
with the present invention, each application module type 



20 



3 



BNSDCCID: <EP 0981089A2_I_> 



EP 0 981 089 A2 



is fault-protected with a degree of replication specified 
for that application module, thereby maintaining a con- 
. .rint number of copies of that application module in a 
running state for protection against multiple failures of 
that type of application module. 

[0014] In order for an idle or backup application mod- 
ule to assume the functioning of a failed primary appli- 
cation module upon failure-detection with a minimum of 
processing disruption, the last operating state of the 
failed application module must be provided to the back- 
up or idle application module upon its execution from 
the idle state or upon its being designated as the new 
primary application module. A Checkpoint Server 110 
connected to network 110 periodically receives from 
each fault-protected application module running on the 
network the most current state of that application, which 
state is then stored in Us memory. Upon failure detection 
of an application module, the last stored state of that 
failed application module is retrieved from the memory 
of Checkpoint Server 110 and provided to the new pri- 
mary application module for continued processing. 
[0015] In accordance with the present invention, an 
application module is made reliable by registering itself 
for its own failure detection and recovery. Specifically, a 
centralized ReplicaManager daemon process 112 run- 
ning on one of the host computers (host computer H2 
in FIG. 1) in the network, receives a registration request 
from each failure-protected application module. The 
registration request includes for the particular applica- 
tion module the style of replication (i.e., hot, warm, and 
cold), the degree of replication, a list of the host com- 
puters on which the application module resides and 
x where on each such host computer the executable pro- 
gram can be found, and a switching style: The degree 
of replication specifies the total number of copies of an 
application module. Thus, for a hot or warm replication 
style, the degree of replication defines the total number 
of running copies of an application module that are to 
be maintained in the network. For a cold replication 
style, the degree of replication specifies the number of 
host computers in the network from which the applica- 
tion module can be run. The switching style specifies a 
fail-over strategy that determines when an application 
module should be migrated from one host computer to 
another host computer. With respect to the latter, when 
a failure of a application module is detected, it can either 
be restarted on the same host computer on. which the 
failure took place, or it can be migrated to another host 
computer on which an idle or running backup copy re- 
sides. Two faH-over strategies can be specified upon 
registration of the application module with the Replica- 
Manager With the first, known as OnOverThreshold, an 
f application module is migrated to another host computer 
after the number of times that the application module 
has failed on a given host computer exceeds a given 
threshold. Thus, with this strategy, the failed application 
module is restarted on its own host computer until the 
number of times the application module fails reaches the 



threshold number. Thereafter, the failed application 
module is migrated to another host computer. With the 
second fail-over strategy, known as OnEachFailure, a 
failed application module is migrated to another host 

5 computer each time a failure occurs. 

[0016] The ReplicaManager daemon process 112 has 
consolidated in its memory the replication information 
for all registered application modules in the network. For 
each type of application module running in the network, 

10 the ReplicaManager stores the information necessary 
to effect recovery of a running application module or an 
entire host computer running several different applica- 
tion modules. FIG. 2 illustrates in a table format 200 the 
type of stored information for the three types of appfica- 

*5 tion modules running on the six host computers in FIG. 
1. As an example, application module of type A is reg- 
istered in entry 201 with a warm backup style with a rep- 
lication degree of three. Thus one primary application 
module is always running together with two backupcop- 

20 ies, with any one of the backup copies being capable of 
taking over functioning as a primary upon the failure of 
the primary copy. As can be noted in FIGS. 1 and 2, the 
primary copy (designated a P m in block 202), A,, is illus- 
tratively shown running on HI and backup copies (des- 

25 ignated *B" in blocks 203 and 204), and Ag, are 
shown running on H2and H3, respectively. An addition- 
al copy of application module type A, At, is shown re- 
siding in memory on H4 in an idle state (designated "P 
m block 205). The pathname location of each copy of 

30 the application module on the host computer is illustra- 
tively shown. Application module type B is registered 
and stored by the ReplicaManager in entry 206 with a 
hot backup style having a degree of two. Thus, two pri- 
mary copies of this application module are maintained 

35 active and running, each processing client requests and 
synchronizing states between each other. The first pri- 
mary copy, B|, is illustratively shown as residing on H4 
and the second primary copy, B2, is shown residing on 
H1 . An idle copy, B3, resides on H5. The third application 

40 module, type C, is registered in entry 207 with a cold 
backup style with a degree of two. thus, a primary copy, 
C 1f is illustratively shown running on H3, and a single 
idle copy is illustratively shown residing on H6. 
[0017] As wilt be discussed, upon detecting a failure 

45 of a primary application module having an OnEachFail- 
ure switching style or an OnOverThreshold switching 
style in which the threshold has been reached, a backup 
application module is designated as a new primary ap- 
plication module in table 200. If the failed application 

50 module has a warm or hot backup style, an idle copy of •/ 
that application module type is executed on its hosting 
computer to maintain the same level of replication in the 
network. Similarly, if a running backup copy of an appli- 
cation module is detected as having failed, an idle copy 

55 of that application module is started on another host 
computer to maintain the same number of running cop- 
ies in the network as specified by the registered degree 
of replication. Further, as will be discussed, upon detect- 
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ing a failure of a host computer, table 200 is accessed 
to determine the identities of the application modules 
running on that computer as either primary copies or 
backup copies. Each such primary or backup copy on 
the failed host computer is then failure-protected in the 
same manner as if each failed individually. 
[001 8] With reference back to FIG. 1 , failure detection 
is effected through a WatchDog daemon process run- 
ning on each host computer. Each such WatchDog dae- 
mon performs the function, once an application module 
has been registered with the ReplicaManager 112, of 
monitoring that running application module and all other 
registered and running application modules on its host 
computer. Accordingly, WatchDog daemon 113-1 mon- 
itors the registered application modules A, and Bg run- 
ning on host computer HI; WatchDog daemon 113-2 
monitors the registered application module Ag running 
on host computer H2; WatchDog daemon 113-3 moni- 
tors the registered application modules A3 and C, run- 
ning on host computer H3; and WatchDog daemon 
113-4 monitors the application module &^ running on 
host computer H4. Since application module A4 in mem- 
ory in host computer H4 is idle, WatchDog daemon 
113-4 does not monitor it until it may later be made ac- 
tive. Similarly, idle application module B3 on host com- 
puter H5 and idle application module C 2 on host com- 
puter H6 are not monitored by WatchDog daemons 
113-5 and 113-6, respectively, until they are executed. 
[001 9] The Watchdog daemons 113 running on each 
host computer support two failure detection mecha- 
nisms: polling and heartbeat In polling, the Watchdog 
daemon periodically sends a ping message to the ap- 
plication module it is monitoring. If the ping fails, its as- 
sumes that the application module has crashed. The 
polling can also be used to provide a sanity check for 
an application module calling a sanity-checking method 
inside the application module. In the heartbeat mecha- 
nism, an application module actively sends heartbeats 
to the Watchdog daemon either on a periodic basis or 
on a per request basis. If the Watchdog daemon does 
not receive a heartbeat within a certain duration, the ap- 
plication module is considered to be hung up. The heart- 
beat mechanism is capable of detecting both crash and 
hang failures of an application module or a host compu- 
ter, whereas the polling mechanism is only capable of 
detecting crash failures. An application module may se- 
lect one of these two approaches based on its reliability 
needs. 

[0020] When a WatchDog daemon detects a crash or 
a hang of an application module that it is "watching*, it 
reports the failure to the ReplicaManager 11 2 for fail- 
over action. As previously noted, if the failed application 
module has registered with an OnEachFailure fail-over 
strategy, the failed application module is migrated to an- 
other host. Thus, if the failed application module is a pri- 
mary copy, one of the backup application modules is 
designated as the new primary and an idle application 
module is executed to maintain the same degree of rep- 



lication for which that application module type has reg- 
istered. Upon promotion of an application module from 
backup status to primary status, its designation in table 
200 is modified, as is the idle application that is execut- 
s ed. If the failed application module is a backup copy, 
then an idle copy is executed and its designation in table 
200 is modified to reflect that change. 
[0021] As noted in FIG. 1 , ReplicaManager 1 1 2 is cen- 
tralized, i.e., there is only one copy of ReplicaManager 
10 running in the network. The replication information for 
each application module running in the network is con T 
solidated in table 200 maintained in the memory of Rep- 
licaManager 112. To prevent loss of this information in 
case of failures, this ReplicaManager table is check- 
's pointed with Checkpoint Server 1 1 0. 

[0022] In addition to the f unctionatrty of the WatchDog 
daemons running on each host computer, a centralized 
SuperWatchDog daemon process115-1 is used to de- 
tect and recover from host crashes. AH WatchDog dae- 
20 mons register with the SuperWatchDog daemon for 
such detection of host failures. Failure protection is ef- 
fected through a heartbeat detection strategy. Thus, 
each of the WatchDog daemons 113 periodically sends 
a heartbeat to the SuperWatchDog daemon 1 1 5-1 . If the 
25 SuperWatchDog daemon 115-1 does not receive a 
heartbeat from any of the WatchDogs 113, it assumes 
that that WatchDog and the host computer on which it 
is running have failed. It then initiates failure recover by 
informing the ReplicaManager 112 of that host compu- 
te ter's failure. Since a centralized SuperWatchDog dae- 
mon could itself become a single point of failure, it is 
itself replicated and the replicas are maintained in a 
warm replication style. In FIG. 1 , SuperWatchDog back- 
up copies 115-2 and 1 1 5-3 of SuperWatchDog 1 1 5-1 are 
35 shown residing on host computers H5 and H6, respec- 
tively. The three SuperWatchDog daemons form a log- 
ical ring structure. Each SuperWatchDog daemon peri- 
odically sends heartbeats to a neighbor SuperWatch- 
Dog. Thus, in FIG. 1 , the primary SuperWatchDog 115-1 
40 periodically sends a heartbeat to SuperWatchDog 
115-2, which, in turn, periodically sends a heartbeat to 
SuperWatchDog 115-3, which, in turn, periodically 
sends a heartbeat back to SuperWatchDog 115-1. If a 
SuperWatchDog does not receive a heartbeat from its 
46 neighbor on the ring, it assumes that a failure has oc- 
curred. A fail-over procedure for a failed SuperWatch- 
Dog is described hereinafter. 

[0023] As an example of recovery from a crashed or 
hung application module, reference will be made to ap- 

50 plication module A, which is registered with Replica- - 
Manager 112 with a warm replication style with a degree 
of three and with a switching style of OnEachFailure. 
Initially application module A-, is running on host com- 
puter H1 with backups and A 3 running on host com- 

5S puters H2 and H3, respectively. Application module A, 
is registered with its local WatchDog 113-1 with the de- 
. tection style of polling, so that WatchDog 113-1 period- 
ically polls application module A v At some time, appii- 
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cation module A t on host computer H1 crashes, which 
failure is detected by WatchDog 113-1. WatchDog 113-1 
reports that failure to ReplicaManager 112, which looks 
up its internal table 200 and decides that a primary ap- 
plication module of type A has failed and that backup 
applications are running on host computers H2 and H3. 
It promotes one of these backups (A2, for example) to 
primary status and changes the status of A2 from back- 
up to primary in table 200. It then notes that an idle copy, 
A4, is resident on host computer H4 at pathname loca- 
tion Vhome/chung/A.exe , and starts that new backup by 
informing the WatchDog 113-4 on host computer H4 to 
execute that copy. Thus, a total of three copies of appli- 
cation module A remain running in the network after de- 
tection and recovery from the failure of application mod- 
ule A-, on host computer H1 t thereby maintaining the 
number of running application modules in the network 
at three, equal to the registered degree of replication. 
The failure detection and recovery for a hung application 
module will be exactly the same except in that case, 
heartbeats, instead of polling, are used as a means for 
failure detection. 

[0024] The Watch Dog running on each host computer 
sends heartbeats to the primary SuperWatchDog in the 
network. Thus, WatchDogs 113-1 - 113-6 send heart- 
beats to SuperWatchDog 115-1 . When a host crash oc- 
curs, the WatchDog running on it crashes and Super- 
WatchDog 115-1 stops receiving heartbeats from that 
WatchDog. If, for example, host H1 crashes, Super- 
WatchDog 115-1 stops receiving heartbeats from 
WatchDog 113-1. It then declares host computer HI 
dead and reports that failure to ReplicaManager 112. 
ReplicaManager 112 accesses table 200 to determine 
that application modules A, and B2 were running of host 
computer H1. Recovery for A, is initiated as previously 
described. Application module Bg is noted to be a pri- 
mary copy. The idle copy Bj residing on host computer 
H5 is then executed, thereby maintaining two running 
primary copies of application module type B in the net- 
work. The status of B 3 is then updated in table 200 from 
idle to primary. The failure of a WatchDog daemon run- 
ning on a host computer is treated in the same manner 
as a host crash. 

[0025] When the host computer on which a Super- 
WatchDog daemon is running crashes, the Super- 
WatchDog on the next host computer on the logical ring 
stops receiving heartbeats. Thus, if host computer H6 
fails, or SuperWatchDog 115-3 on host computer crash- 
es, SuperWatchDog 115-1 on host computer H2 stops 
receiving heartbeats from SuperWatchDog 115-3. It de- 
clares SuperWatchDog 1 15-3 dead and checks to see 
if the dead SuperWatchDog 1 1 5-3 was a primary Super- 
WatchDog. Since SuperWatchDog 115-3 is a backup, it 
does not need to take any action on behalf of that Su- 
perWatchDog; The SuperWatchDog 115-2 will then get 
an exception when it tries to send its heartbeat to the 
SuperWatchDog on host computer H6. As part of ex- 
ception handling, SuperWatchDog 115-2 determines 



the handle for SuperWatchDog 115-1 on host computer 
H1 , registers itself with it and starts sending heartbeats 
to it 

[0026] If host computer H2 fails or SuperWatchDog 

s 115-1 crashes, then SuperWatchDog 115-2 on host 
computer H5 detects the failure and determines that the 
primary SuperWatchDog has failed. Backup Super- 
WatchDog 1 1 5-2 then takes over the role of the primary 
and starts the ReplicaManager daemon on host compu- 

10 ter H5. The Watchdogs 113-1-11 3-6 on host computers 
H1 through H6, respectively, get exceptions when they 
attempt to send heartbeats to the SuperWatchDog 
115-1 on host computer H2 (which was the primary). As 
part of the exception handling routine, each WatchDog 

is daemon discovers the new primary SuperWatchDog 
115-2, and the ReplicaManager 112 registers itself with 
the new primary S up er Watch Dog 1 1 5-2 and starts send- 
ing It periodic heartbeats. Since only one copy of the 
ReplicaManager daemon is running in the network, the 

20 state of the ReplicaManager is made persistent by stor- 
ing the table 200 in the Checkpoint Server 110. Thus, 
when the ReplicaManager is migrated to host computer 
H5 with the new primary SuperWatchDog 115-2, the 
ReplicaManager started on that host loads its state from 

25 the Checkpoint Server 110 and reinitializes its internal 
table from its stored state. Similarly, if the ReplicaMan- 
ager 112 fails, then its failure is detected by Super- 
WatchDog 115-1 from the absence of heartbeats, Su- 
perWatchDog 115-1 then restarts ReplicaManager 112 

30 on the same host computer, loading its state from the 
Checkpoint Server 1.10, and reinitializing its internal ta- 
ble 200 from its stored state. 

[0027] The above-described embodiment is illustra- 
tive of the principles of the present invention. Other em- 
35 bodiments may be devised by those skilled in the art 
without departing from the scope of the present inven- 
tion. 



40 Claims 

1. A computer system for fault tolerant computing 
comprising: 

45 a plurality of host computers interconnected on 

a network; 

one or more copies of an application module 
each running on a different one of said plurality 
of host computers; 
50 one or more idle backup copies of the applica- * 

tion module each stored on a different one of 
said host computers; 

a manager daemon process running on one of 
said plurality of host computers, the manager 
55 daemon process receiving an indication, upon 

a failure one of said running copies of the ap- 
plication module and initiating failure recovery; 
and 
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means for providing a registration message to 
said manager daemon process, said registra- 
tion message specifying said application mod- 
ule and a degree of replication of said applica- 
tion module, said degree of replication indicat- s 
ing the number of running copies of the first ap- 
plication module to be maintained in the sys- 
tem; 

wherein the number of running copies of the ap- 
plication module is maintained at the registered to 
degree of replication by executing at least one 
of said idle backup copies upon detecting one 
or more failures, respectively, of any of the run- 
ning copies of said application module. 

75 

The computer system of claim 1 further comprising: 
a plurality of failure-detection daemon proc- 
esses each running on and associated with the host 
computer on which each copy of the application 
module is running, each of said failure-detedtion 20 
daemon processes monitoring the ability of its as- 
sociated copy of the application module to continue 
to run, each failure-detection daemon process 
sending to said manager daemon process a mes- 
sage indicating a failure of its associated copy of 25 
the application module upon detecting its failure. 

The computer system of claim 2 further comprising: 
a checkpoint server connected to the network, 
said checkpoint server periodically storing the 30 
states of each of said running copies of said appli- 
cation module and said manager daemon process. 

The computer system of claim 3 wherein upon de- 
tection of the failure of one of said running copies 
of said application module, said manager daemon 
process signals one of said at least one idle backup 
copies to execute and to assume the processing 
functions of the failed copy, said one backup copy 
retrieving from said checkpoint server the last 40 
stored state of the failed copy of the application 
module. 

The computer system of claim 3 further comprising: 
a second failure-detection daemon process 
running on the same host computer as the manager, 
daemon process, said second failure-detection 
process monitoring a host computer on which one 
of the copies of the application module is running 
for a failure. so 

The computer system of claim 5 wherein upon de- 
tection of a failure of the monitored host computer, 
said manager daemon process signals one of said 
idle backup copies to execute and to assume the ss 
processing functions of the copy of the application 
module running on the failed host computer, the ex- 
ecuted backup copy retrieving from said checkpoint 



server the last stored state of the copy of the appli- 
cation module running on the failed host computer. 

7. The computer system of system of claim 5 further 
comprising: 

a backup copy of said second failure-detec- 
tion daemon process running on one of said plural- 
ity of host computers other than the host computer 
on which the second failure-detection daemon 
process is running, said copy of said second failure- 
detection process monitoring the host computer on 
which the second failure-detection daemon process 
is running for a failure. 

a The computer system of claim 7 wherein upon de- 
tection of a failure of the host computer on which 
the second failure-detection daemon process is 
running, said backup copy of said second failure- 
detection daemon process assumes the processing 
functions of said second failure-detection daemon 
process and initiates running of a copy of said man- 
ager daemon process on its own host computer, 
said copy of said manager daemon process retriev- 
ing from said checkpoint server the last stored state 
of said manager daemon process while it was run- 
ning on said second host computer. 

9. The computer system of claim 1 wherein the regis- 
tration message for the application module further 
specifies a style of replication that indicates whether 
the replication style for the first application module 
is to be cold, warm or hot. 

10. The computer system of claim 4 wherein the regis- 
tration message for the application module further 
specifies a fail-over strategy, the fail-over strategy 
indicating whether one of said idle backup copies 
should assume the processing functions of a failed 
one of said running copies each time a failure of that 
one running copy is detected by said first failure- 
detection process, or whether said one of said idle 
backup copies should assume the processing func- 
tions of said one failed running copy only after the 
number of failures of that one copy of said applica- 
tion module reaches a predetermined threshold. 

11. A fault-managing computer apparatus on a host 
computer in a computer system, said apparatus 
comprising: 

a manager daemon process for receiving an in- 
dication of a failure of a copy of an application 
module running on a host computer in the com- 
puter system and for initiating failure recovery 
with at least one idle backup copy of the appli- 
cation module; and 

means for receiving a registration message 
specifying the application module and a degree 
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of replication for the application module, said 
degree of replication indicating the number of 
running copies of the application module to be 
maintained in the system; 
wherein the number of running copies of the ap- 
plication module in the system is maintained at 
the registered degree of replication by execut- 
ing one of the idle backup copies upon detect- 
ing a failure of one of the running copies of the 
application module. 

12. The apparatus of claim 11 wherein upon receiving 
an indication of a failure of one of the running copies 
of the application module said manager daemon 
process signals one of the idle backup copies to as- 
sume the processing functions of the failed copy 

13. The apparatus of claim 11 further comprising a fail- 
ure-detection daemon process for monitoring each 
host computer in the system for a failure. 

14. The apparatus of claim 13 wherein upon said fail- 
ure-detection daemon process detecting a failure of 
one of the host computers on which a copy of the 
application module is running, said manager dae- 
mon process signals one of said at least one Idle 
backup copies to assume the processing functions 
of the copy of the application module on the failed 
host computer. 

15. A fault-tolerant computing apparatus for use in a 
computer system, said apparatus comprising: 

a failure-detection daemon process running on 
said apparatus, said failure-detection daemon 
process monitoring the ability of a running copy 
of an application module to continue to run on 
said apparatus; and 

means for sending a registration message to a 
manager daemon process specifying the appli- 
cation module and a degree of replication to be 
maintained by the manager daemon process 
for the application module with respect to the 
number of running copies of the application 
module to be maintained in the system; 
wherein the number of running copies of the ap- 
- plication module in the system is maintained at 
the registered degree of replication by execut- 
ing an idle backup copy of the application mod- 
ule on a different computing apparatus upon 
detecting a failure of the running copy of the ap- 
plication module. 

16. The apparatus of claim 15 wherein upon detecting 
a failure of the running copy of the application mod- 
ule on the apparatus, the idle backup copy of the 
application module is executed and assumes the 
processing functions of the failed copy. 



17. The apparatus of claim 15 wherein the registration 
message further specifies a style of replication that 
indicates that the application module is to be repli- 
cated in the computer system with a cold, warm or 

s hot backup style. 

18. A method for operating a fault-tolerant computer 
system, said system comprising a plurality of host 
computers interconnected on a network, one or 

10 more copies of an application module each one run- 
ning on a different one of said plurality of host com- 
puters, and one or more idle backup copies of the 
application module each stored on a different one 
of said host computers; said method comprising the 

is steps of: 

receiving a registration message specifying the 
first application module and a degree of repli- 
cation to be maintained for the application mod- 

20 ule, said degree of replication indicating the 

number of running copies of the application 
module to be maintained in the system; and 
executing at least one of the idle backup copies 
upon detecting a failure of one of the running 

2S copies of the application module to maintain the 

total number of running copies of the applica- 
tion module in the system at the registered de- 
gree of replication. 

30 19. The method of claim 18 further comprising the steps 
of: 

receiving an indication upon a failure of the one 
of the running copies of the application module; 
3S and 

initiating failure recovery for the failed copy with 
at least one of the idle backup copies. 

20. The method of claim 18 further comprising the steps 
40 of: 



monitoring one of the host computers on which 
a copy of the application module is running; and 
upon detecting a failure of that host computer, 
initiating failure recovery for the copy of the ap- 
plication module on that host computer with one 
of the idle backup copies. 



45 



so 



ss 



21. The method of claim 18 wherein the registration 
message for the application module further speci-v 
Ties a style of replication that indicates whether the 
replication style for the application module is to be 
cold, warm or hot.. 

22. The method of claim 19 wherein the registration 
message for the application module further speci- 
fies a fail-over strategy, the fail-over strategy indi- 
cating whether one of the idle backup copies should 
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assume the processing functions of a failed one of 
the running copies each time a failure of that one 
running copy is detected, or whether one of the idle 
backup copies should assume the processing func- 
tions of that one failed running copy only after the s 
number of failures of that one copy a predetermined 
threshold. 
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ReplicaManager daemon process (112) by sending a 
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identifying the registering application module and the 
host computer on which it is running, includes the par- 
ticular replication strategy (cold backup, warm backup, 
or hot backup) and the degree of replication associated 
with that application module. The backup copies are 
then maintained in a fail-overstate according to the reg- 
istered replication strategy. A WatchDog daemon (113), 
running on the same host computer as the registered 
application periodically monitors the registered applica- 
tion to detect failures. When a failure, such as a crash 
or hangup of the application module, is detected, the fail- 
ure is reported to the ReplicaManager, which effects the 
requested fail-over actions. An additional backup copy 
is then made operative in accordance with the regis- 
tered replication style and the registered degree of rep- 
lication. A SuperWatchDog daemon process (115-1), 



running on the same host computer as the ReplicaMan- 
ager, monitors each host computer in the computer net- 
work. When a host failure is detected, each application 
module running on that host computer is individually fail- 
ure-protected in accordance with its registered replica- 
tion style and degree of replication. 
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